This is the simplest way to pass data to the indexer. It was created due to xmlpipe2 limitations. Namely, indexer must map each attribute and field tag in XML file to corresponding schema element. This mapping requires some time. And time increases with increasing the number of fields and attributes in schema. There is no such issue in tsvpipe because each field and attribute is a particular column in TSV file. So, in some cases tsvpipe could work slightly faster than xmlpipe2.

The first column in TSV/CSV file must be a document ID. The rest ones must mirror the declaration of fields and attributes in schema definition. Note that you don't need to declare the document ID in the schema, since it's always considered to be present, should be in the 1st column and needs to be a unique signed positive non-zero 64-bit integer.

The difference between tsvpipe and csvpipe is delimiter and quoting rules. tsvpipe has tab character as hardcoded delimiter and has no quoting rules. csvpipe has option csvpipe_delimiter for delimiter with default value , and also has quoting rules, such as:

any field may be quoted
fields containing a line-break, double-quote or commas should be quoted
a double quote character in a field must be represented by two double quote characters

tsvpipe_command directive is mandatory and contains the shell command invoked to produce the TSV stream which gets indexed. The command can just read a TSV file but it can also be a program that generates on-the-fly the tab delimited content.

The following directives can be used to declare the types of the indexed columns:

tsvpipe_field - declares a text field.
tsvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
tsvpipe_attr_uint - declares an integer attribute
tsvpipe_attr_timestamp - declares a timestamp attribute
tsvpipe_attr_bool - declares a boolean attribute
tsvpipe_attr_float - declares a float attribute
tsvpipe_attr_bigint - declares a big integer attribute
tsvpipe_attr_multi - declares a multi-value attribute with integers
tsvpipe_attr_multi_64 - declares a multi-value attribute with 64bit integers
tsvpipe_attr_string - declares a string attribute
tsvpipe_attr_json - declares a JSON attribute

Example of a source using a TSV file

source tsv_test
{
    type = tsvpipe
    tsvpipe_command = cat /tmp/rock_bands.tsv
    tsvpipe_field = name
    tsvpipe_attr_multi = genre_tags
}

1   Led Zeppelin    35,23,16
2   Deep Purple 35,92
3   Frank Zappa 35,23,16,92,33,24

csvpipe_command directive is mandatory and contains the shell command invoked to produce the TSV stream which gets indexed. The command can just read a CSV file but it can also be a program that generates on-the-fly the comma delimited content.

The following directives can be used to declare the types of the indexed columns:

csvpipe_field - declares a text field.
csvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
csvpipe_attr_uint - declares an integer attribute
csvpipe_attr_timestamp - declares a timestamp attribute
csvpipe_attr_bool - declares a boolean attribute
csvpipe_attr_float - declares a float attribute
csvpipe_attr_bigint - declares a big integer attribute
csvpipe_attr_multi - declares a multi-value attribute with integers
csvpipe_attr_multi_64 - declares a multi-value attribute with 64bit integers
csvpipe_attr_string - declares a string attribute
csvpipe_attr_json - declares a JSON attribute

Example of a source using a CSV file

source csv_test
{
    type = csvpipe
    csvpipe_command = cat /tmp/rock_bands.csv
    csvpipe_field = name
    csvpipe_attr_multi = genre_tags
}

1,"Led Zeppelin","35,23,16"
2,"Deep Purple","35,92"
3,"Frank Zappa","35,23,16,92,33,24"

Main+delta schema

In many situations, the total dataset is too large to be frequently rebuilt from scratch, while the number of new records remains relatively small. For example, a forum may have 1,000,000 archived posts but only receive 1,000 new posts per day.

In such cases, implementing "live" (nearly real-time) table updates can be achieved using a "main+delta" scheme.

The concept involves setting up two sources and two tables, with one "main" table for data that rarely changes (if ever), and one "delta" table for new documents. In the example, the 1,000,000 archived posts would be stored in the main table, while the 1,000 new daily posts would be placed in the delta table. The delta table can then be rebuilt frequently, making the documents available for searching within seconds or minutes. Determining which documents belong to which table and rebuilding the main table can be fully automated. One approach is to create a counter table that tracks the ID used to split the documents and update it whenever the main table is rebuilt.

Using a timestamp column as the split variable is more effective than using the ID since timestamps can track not only new documents but also modified ones.

For datasets that may contain modified or deleted documents, the delta table should provide a list of affected documents, ensuring they are suppressed and excluded from search queries. This is accomplished using a feature called Kill Lists. The document IDs to be killed can be specified in an auxiliary query defined by sql_query_killlist. The delta table must indicate the target tables for which the kill lists will be applied using the killlist_target directive. The impact of kill lists is permanent on the target table, meaning that even if a search is performed without the delta table, the suppressed documents will not appear in the search results.

Notice how we're overriding sql_query_pre in the delta source. We must explicitly include this override. If we don't, the REPLACE query would be executed during the delta source's build as well, effectively rendering it useless.

‹›

Example

Example

📋

# in MySQL
CREATE TABLE deltabreaker (
  index_name VARCHAR(50) NOT NULL,
  created_at TIMESTAMP NOT NULL  DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (index_name)
);

# in manticore.conf
source main {
  ...
  sql_query_pre = REPLACE INTO deltabreaker SET index_name = 'main', created_at = NOW()
  sql_query =  SELECT id, title, UNIX_TIMESTAMP(updated_at) AS updated FROM documents WHERE deleted=0 AND  updated_at  >=FROM_UNIXTIME($start) AND updated_at  <=FROM_UNIXTIME($end)
  sql_query_range  = SELECT ( SELECT UNIX_TIMESTAMP(MIN(updated_at)) FROM documents) min, ( SELECT UNIX_TIMESTAMP(created_at)-1 FROM deltabreaker WHERE index_name='main') max
  sql_query_post_index = REPLACE INTO deltabreaker set index_name = 'delta', created_at = (SELECT created_at FROM deltabreaker t WHERE index_name='main')
  ...
  sql_attr_timestamp = updated
}

source delta : main {
  sql_query_pre =
  sql_query_range = SELECT ( SELECT UNIX_TIMESTAMP(created_at) FROM deltabreaker WHERE index_name='delta') min, UNIX_TIMESTAMP() max
  sql_query_killlist = SELECT id FROM documents WHERE updated_at >=  (SELECT created_at FROM deltabreaker WHERE index_name='delta')
}

table main {
  path = /var/lib/manticore/main
  source = main
}

table delta {
  path = /var/lib/manticore/delta
  source = delta
  killlist_target = main:kl
}

Fetching from CSV,TSV Merging tables

Adding data from tables

Merging two existing plain tables can be more efficient than indexing the data from scratch and desired in some cases (such as merging 'main' and 'delta' tables instead of simply rebuilding 'main' in the 'main+delta' partitioning scheme). So indexer has an option to do that. Merging tables is normally faster than rebuilding, but still not instant on huge tables. Basically, it will need to read the contents of the both tables once and write the result once. Merging 100 GB and 1 GB table, for example, will result in 202 GB of I/O (but that's still likely less than the indexing from scratch requires).

The basic command syntax is as follows:

sudo -u manticore indexer --merge DSTINDEX SRCINDEX [--rotate] [--drop-src]

Unless --drop-src is specified only the DSTINDEX table will be affected: the contents of SRCINDEX will be merged into it.

--rotate switch will be required if DSTINDEX is already being served by searchd.

The typical usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX. Thus, when merging attributes the values from SRCINDEX will win if duplicate document IDs are encountered. Note, however, that the "old" keywords will not be automatically removed in such cases. For example, if there's a keyword "old" associated with document 123 in DSTINDEX, and a keyword "new" associated with it in SRCINDEX, document 123 will be found by both keywords after the merge. You can supply an explicit condition to remove documents from DSTINDEX to mitigate that; the relevant switch is --merge-dst-range:

sudo -u manticore indexer --merge main delta --merge-dst-range deleted 0 0

This switch lets you apply filters to the destination table along with merging. There can be several filters; all of their conditions must be met in order to include the document in the resulting merged table. In the example above, the filter passes only those records where 'deleted' is 0, eliminating all records that were flagged as deleted.

--drop-src allows dropping SRCINDEX after the merge and before rotating the tables, which is important in case you specify DSTINDEX in killlist_target of DSTINDEX, otherwise when rotating the tables the documents that have been merged into DSTINDEX may be suppressed by SRCINDEX.

Main+delta schema Killlists in plain tables

Fetching from TSV,CSV

File format

Declaration of TSV stream

TSV indexed columns

Declaration of CSV stream

CSV indexed columns

Main+delta schema

Adding data from tables

Merging tables